14-logistic-regression

Author

Professor Shannon Ellis

Published

February 28, 2023

Logistic Regression

Q&A

Q: Understanding the meaning behind the heat maps. How is it collinear regarding race and not age? What would collinearity look like if it were correlated with age?
A: Great question - we’re going to continue with this today!

Q: Are we allowed to use the analysis we use in class, for our analysis for the case study? As long as we understand it, can we just copy and paste?
A: Kind of. What we discuss in class will help answer Q2 about collinearity…and will get you well on your way to answering Q1 about the relationship between RTC laws and violent crime. So, yes, you can copy+paste anything from class that you want, but you’ll need to additionally make some decisions to fully answer the questions.

Q: I am confused about if our group accidentally works on the same lines/parts of the file for cs01, will our work be overwritten by those who push later? I’m not familiar to GitHub so I may need to explain more.
A: If you work on the exact same lines, yes, this will cause a “merge conflict” …so it will not silently overwrite what someone else did…but it will force you to decide whose version you want to keep. Best solution is to not work on the same parts of the file at the same time.

[ad/gross self-promotion?] Last Lecture Wed @ 5pm

Last Lecture: Life Lessons That Have Nothing to Do with Data or Science A UCSD Data Science Education will teach you a lot. There will be programming, data, dataviz, statistics, machine learning, linear algebra, ethics, capstone projects, and domain knowledge galore. But, these courses will not teach you the very specific lessons that Prof Ellis has learned along her journey. Come hear the advice that took Prof Ellis decades to receive, their surrounding stories, and the lessons she hopes you learn faster than she did in one jam-packed chat.

The Last Lecture Series is a huge opportunity for students to gain some insight about a Professor’s journey and the obstacles they had to overcome to get to where they are today, especially coming from professors who have reached success in the field of data science. We highly encourage you to attend!

This event will be happening at 5pm on Wednesday, 3/1.

We’ll be hosting this at the SDSC Auditorium! Registration is no longer needed.

Course Announcements

Due Dates:

  • Lecture Participation survey “due” after class
  • Lab07 due Friday (3/3; 11:59 PM)
  • CS01 due Mon (3/6; 11:59 PM)

Notes:

  • lab 05 and lab06 scores posted
  • lab07 will be posted later today (wanted to start the material before posting lab)
  • cs01 team willing to take on an additional member?

Agenda

  • Lab06 Review
  • Logistic Regression
    • Single predictor
    • Multiple predictors
  • Model evaluation

Lab 06: EDA

EDA Example #1: Shenova

ggplot(DONOHUE_DF, aes(y=Viol_crime_count, x=YEAR)) + 
  geom_line() + 
  labs(
    title = "Violent Crime Rate by State and Year", 
    x = "Year",
    y = "Total Violent Crime Rate") +
  
  facet_wrap(~STATE, nrow = 5)+ 
  theme(axis.text.x = element_text(angle = 90), plot.title.position = "plot")

EDA Example #2

p2 <- DONOHUE_DF |>
  group_by(STATE) |>
  summarise(RTC_LAW_YEAR=RTC_LAW_YEAR) |>
  distinct() |>
  ggplot(aes(x=RTC_LAW_YEAR)) +
  geom_bar() +
  scale_x_continuous(
    breaks = seq(1980, 2015, by = 1)
  ) +
  labs(
    title = "Distribution of RTC Law Years",
    x = "RTC Law Year", y = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90), plot.title.position = "plot")
p2

EDA Example #3: Sebastian

library(maps)

# load state map data
states_map <- map_data("state")

# merge state map data with DONOHUE_DF data
DONOHUE_DF_map <- merge(states_map, DONOHUE_DF, by.x = "region", by.y = "STATE")

# plot map using ggplot2
ggplot() +
  geom_polygon(data = DONOHUE_DF_map, aes(x = long, y = lat, group = group, fill = cut(Viol_crime_rate_1k, breaks = c(0, 3, 4, 5, 6, 7, 8, 9, 10), labels = c("0-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"))), color = "white", size = 0.1) +
  coord_fixed() +
  theme_void() +
  scale_fill_brewer(name = "Violent Crime Rate per 1k", palette = "YlOrRd", na.value = "white",
                    labels = c("0-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"),
                    breaks = c("0-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10")) +
  labs(title = "Southern States Have the Highest Violent Crime Rates",
       subtitle = "Violent Crime Rates per 1000 in Each State",
       caption = "Note: White areas indicate missing data") +
  theme(plot.title = element_text(size = 15, face = "bold"),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 8, hjust = 0),
        legend.position = "bottom",
        legend.title.align = 0.5,
        legend.text = element_text(size = 6),
        legend.title = element_text(size = 8))

Lab06: Brainstorming

  • Focusing in on a variable in the data (i.e. poverty, # of police officers, etc.) and answering more detailed questions about it (in relation to main question)
  • Focusing in on a demographic subset (i.e. specific race or specific age group) and asking the questions posed within that group specifically
  • Analyzing with a focus on geographic differences (i.e. border states, specific states - be sure to explain decision)
  • Focusing on a different aspect of guns (i.e. accidental fatalities, specific types of crimes, specific type of carry, etc.)
  • Focus in and go deep on a specific period of time
  • Studying the relationship between gun laws and violent crime rates in other countries
  • Consider a “new” but related variable (i.e. “overpolicing”)

Predicting categorical data

Spam filters

  • Data from 3921 emails and 21 variables on them
  • Outcome: whether the email is spam or not
  • Predictors: number of characters, whether the email had “Re:” in the subject, time at which email was sent, number of times the word “inherit” shows up in the email, etc.
library(openintro)
glimpse(email)
Rows: 3,921
Columns: 21
$ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple  <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ from         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc           <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, …
$ sent_email   <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ time         <dttm> 2011-12-31 22:16:41, 2011-12-31 23:03:59, 2012-01-01 08:…
$ image        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, 0, …
$ winner       <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
$ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.091, 4.837, 7.421…
$ line_breaks  <int> 202, 202, 192, 255, 29, 25, 193, 237, 69, 68, 25, 79, 191…
$ format       <fct> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ re_subj      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ urgent_subj  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4, 10, 20, 0…
$ number       <fct> big, small, small, small, none, none, big, small, small, …

❓ Would you expect longer or shorter emails to be spam??

# A tibble: 2 × 2
  spam  mean_num_char
  <fct>         <dbl>
1 0             11.3 
2 1              5.44

❓ Would you expect emails that have subjects starting with “Re:”, “RE:”, “re:”, or “rE:” to be spam or not?

Modelling spam

  • Both number of characters and whether the message has “re:” in the subject might be related to whether the email is spam. How do we come up with a model that will let us explore this relationship?
  • For simplicity, we’ll focus on the number of characters (num_char) as predictor, but the model we describe can be expanded to take multiple predictors as well.

Modelling spam

This isn’t something we can reasonably fit a linear model to – we need something different!

Framing the problem

  • We can treat each outcome (spam and not) as successes and failures arising from separate Bernoulli trials
    • Bernoulli trial: a random experiment with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted
  • Each Bernoulli trial can have a separate probability of success

\[ y_i ∼ Bern(p) \]

  • We can then use the predictor variables to model that probability of success, \(p_i\)
  • We can’t just use a linear model for \(p_i\) (since \(p_i\) must be between 0 and 1) but we can transform the linear model to have the appropriate range

Generalized linear models

  • This is a very general way of addressing many problems in regression and the resulting models are called generalized linear models (GLMs)
  • Logistic regression is just one example

Three characteristics of GLMs

All GLMs have the following three characteristics:

  1. A probability distribution describing a generative model for the outcome variable
  2. A linear model: \[\eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]
  3. A link function that relates the linear model to the parameter of the outcome distribution

Logistic regression

Logistic regression

  • Logistic regression is a GLM used to model a binary categorical outcome using numerical and categorical predictors
  • To finish specifying the Logistic model we just need to define a reasonable link function that connects \(\eta_i\) to \(p_i\): logit function
  • Logit function: For \(0\le p \le 1\)

\[logit(p) = \log\left(\frac{p}{1-p}\right)\]

Logit function, visualised

Properties of the logit

  • The logit function takes a value between 0 and 1 and maps it to a value between \(-\infty\) and \(\infty\)
  • Inverse logit (logistic) function: \[g^{-1}(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}\]
  • The inverse logit function takes a value between \(-\infty\) and \(\infty\) and maps it to a value between 0 and 1
  • This formulation is also useful for interpreting the model, since the logit can be interpreted as the log odds of a success – more on this later

The logistic regression model

  • Based on the three GLM criteria we have
    • \(y_i \sim \text{Bern}(p_i)\)
    • \(\eta_i = \beta_0+\beta_1 x_{1,i} + \cdots + \beta_n x_{n,i}\)
    • \(\text{logit}(p_i) = \eta_i\)
  • From which we get

\[p_i = \frac{\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}{1+\exp(\beta_0+\beta_1 x_{1,i} + \cdots + \beta_k x_{k,i})}\]

Modeling spam

In R we fit a GLM in the same way as a linear model except we:

  • specify the model with logistic_reg()
  • use "glm" instead of "lm" as the engine
  • define family = "binomial" for the link function to be used in the model
spam_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(spam ~ num_char, data = email, family = "binomial")

tidy(spam_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -1.80     0.0716     -25.1  2.04e-139
2 num_char     -0.0621   0.00801     -7.75 9.50e- 15

Spam model

tidy(spam_fit)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -1.80     0.0716     -25.1  2.04e-139
2 num_char     -0.0621   0.00801     -7.75 9.50e- 15

Model:

\[ \log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times \text{num_char} \]

P(spam) for an email with 2000 characters

\[\log\left(\frac{p}{1-p}\right) = -1.80-0.0621\times 2\]

\[\frac{p}{1-p} = \exp(-1.9242) = 0.15 \rightarrow p = 0.15 \times (1 - p)\]

\[p = 0.15 - 0.15p \rightarrow 1.15p = 0.15\]

\[p = 0.15 / 1.15 = 0.13\]

❓ What is the probability that an email with 15000 characters is spam? What about an email with 40000 characters?

  • 2K chars: P(spam) = 0.13
  • 15K chars, P(spam) = 0.06
  • 40K chars, P(spam) = 0.01

❓ Would you prefer an email with 2000 characters to be labelled as spam or not? How about 40,000 characters?

Sensitivity and Specificity

False positive and negative

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • False negative rate = P(Labelled not spam | Email spam) = FN / (TP + FN)

  • False positive rate = P(Labelled spam | Email not spam) = FP / (FP + TN)

Sensitivity and Specificity

Email is spam Email is not spam
Email labelled spam True positive False positive (Type 1 error)
Email labelled not spam False negative (Type 2 error) True negative
  • Sensitivity = P(Labelled spam | Email spam) = TP / (TP + FN)
    • Sensitivity = 1 − False negative rate
  • Specificity = P(Labelled not spam | Email not spam) = TN / (FP + TN)
    • Specificity = 1 − False positive rate

❓ If you were designing a spam filter, would you want sensitivity and specificity to be high or low? What are the trade-offs associated with each decision?

Modeling Spam : Multiple predictors

spam_mult <- logistic_reg() |>
  set_engine("glm") |>
  fit(spam ~ num_char + to_multiple + re_subj, data = email, family = "binomial")

tidy(spam_mult)
# A tibble: 4 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -1.20     0.0752     -16.0  2.21e-57
2 num_char      -0.0686   0.00781     -8.78 1.57e-18
3 to_multiple1  -2.14     0.299       -7.18 6.92e-13
4 re_subj1      -3.12     0.360       -8.66 4.70e-18

Model: Multiple predictors

tidy(spam_mult)
# A tibble: 4 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -1.20     0.0752     -16.0  2.21e-57
2 num_char      -0.0686   0.00781     -8.78 1.57e-18
3 to_multiple1  -2.14     0.299       -7.18 6.92e-13
4 re_subj1      -3.12     0.360       -8.66 4.70e-18

\[ \begin{aligned} log_e \left(\frac{p}{1 - p}\right) &= - 1.20 - 0.07 \times \texttt{num_char} \\ &\quad - 2.14\times \texttt{to_multiple}_{\texttt{1}} \\ &\quad - 3.12 \times \texttt{re_subj}_{\texttt{1}} \\ \end{aligned} \]

Model: Multiple predictors

So for an email with 4,000 characters (4), addressed to a single recipient (0), and that did start with “re:” in the subject line (1)…

\[ \begin{aligned} log_e \left(\frac{p}{1 - p}\right) = - 1.20 - 0.07 \times 4 - 2.14\times 0 - 3.12 \times 1 \end{aligned} \]

Model: Multiple predictors

\[ \begin{aligned} log_e \left(\frac{p}{1 - p}\right) = - 2.2 \end{aligned} \]

…solve for \(\widehat{p}\)

\[ \begin{aligned} \frac{e^{-2.2}}{1 + e^{-2.2}} = 0.0998 = 9.98\% \end{aligned} \]

9.98% chance that such an email would be spam

Model Comparison

Akaike information criterion (AIC)

  • popular model selection method
  • estimator of prediction error
  • praised for its emphasis on model uncertainty and parsimony
  • In calculating AIC, a penalty is given for including additional variables. This penalty for added model complexity attempts to strike a balance between underfitting (too few variables in the model) and overfitting (too many variables in the model).
  • a lower AIC value are considered to be “better.”

Comparing Models: AIC

Single predictor (num_char)

glance(spam_fit)
# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         2437.    3920 -1173. 2350. 2363.    2346.        3919  3921

Multiple predictors

glance(spam_mult)
# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         2437.    3920 -1032. 2071. 2097.    2063.        3917  3921

Recap

  • Understand the relationship between linear and logistic regression
  • Carry out, explain, and interpret linear regression with a single predictor and multiple predictors
  • Compare models using AIC

Suggested Reading

Introduction to Modern Statistics Chapter 9: Logistic Regression